[WIP] Phrases: make any2utf8 optional #1413

prakhar2b · 2017-06-14T08:26:59Z

Updated benchmark (for text8)

Optimization	Python 2.7	Python 3.6
original	~ 36-38 sec	~32-34 sec
recode_to_utf8=False	~19-21 sec	~20-22 sec

Speed improvement in Phrases module.
Phrases optimization benchmark (for text8)

Note-- Leaving it for the context of prior discussions

Optimization	Python 2.7	Python 3.6	PR
original	~ 36-38 sec	~32-35 sec
cython (static typing)	~30-32 sec
any2utf8 (without cython)	~20-22 sec	~23-26 sec	This PR
cython (with any2utf8)	~15-18 sec	~19-21 sec	#1385

piskvorky · 2017-06-17T14:45:36Z

What is the memory impact of this change? The conversion was there for a reason, IIRC.

prakhar2b · 2017-06-18T18:54:14Z

@piskvorky yes, we had a discussion here regarding this. This PR will be updated accordingly asap.

jayantj · 2017-06-22T06:19:11Z

gensim/models/phrases.py

@@ -169,7 +169,9 @@ def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000):
            if sentence_no % progress_per == 0:
                logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
                            (sentence_no, total_words, len(vocab)))
-            sentence = [utils.any2utf8(w) for w in sentence]
+
+            sentence = [w for w in (utils.any2utf8(u'_'.join(sentence)).split('_'))]


A few issues here -

You are trying to split a bytestring (the result of the any2utf8 call) by '_' - this will not work in python3+ because literal strings are unicode by default. You've faced similar problems previously, so I think it would be helpful to understand character encodings at a conceptual level, and the differences between string handling in python2 and 3.

Simply sentences = utils.any2utf8(u'_'.join(sentence)).split('_')) would be enough - no need for the extra [w for w in ...]

We're not accounting for the possibility that a word in the sentence contains '_' here - it would be wrong to make implicit assumptions like these about user input, unless there was an explicit constraint in the API. Escaping could be an option - although I'm not sure it is feasible, performance-wise.

…into any2utf8

piskvorky · 2017-06-23T08:37:29Z

gensim/models/phrases.py

-            # sentence = [utils.any2utf8(w) for w in sentence]
-            # Unicode tokens in dictionary (not utf8)
+
+            sentence = [w for w in (utils.any2utf8('_'.join(w for w in sentence)).split('_'))]


This should all be in the optimized path (C / Cython), so there's no point wasting time playing with Python calls.

The utf8 conversion overhead will be ~0, once this is optimized properly.

You can check with cython -a to see how much slow (Python) code is still left on the critical path. There should be none.

piskvorky · 2017-06-23T08:40:07Z

This whole style of optimization is not desirable.

Optimize the code properly, by writing in low-level C/Cython, not by shuffling Python calls around, joining characters in Python or whatnot. That's not the way to gain a significant speed boost.

Use cython -a to verify there are no slow (Python) calls left, on any critical code path. Critical blocks should look like C, using primitive C data structures, not Python.

tmylk · 2017-06-23T12:29:08Z

@piskvorky Some notes on the wider context of this work.

Unfortunately, re-writing code in C is outside of the scope of the June evaluation milestone in Prakhar's GSOC proposal submitted in March. Another part of this proposal is selecting a multi-thread/multi-process architecture for Phrases - he is running experiments for it in the joblib PR.

Once the proper benchmarks for the any2utf8 and Cython optimizations are submitted, these will be an improvement for the existing Phrases code. IMHO any improvements in the code are worth merging (of course unless they complicate too much which these changes don't).

These minor improvements to Phrases have been a good GSOC learning experience for Prakhar in preparation for FastText performance optimisation which is the main focus of his GSOC project.

jayantj · 2017-06-23T12:43:24Z

I agree, even if in the future we follow a different approach, I think the current changes are worthwhile, as they do improve the times for Phrases significantly (of course, we need clearer benchmarks, but from initial results, it certainly seems so).

piskvorky · 2017-06-23T15:52:32Z

I disagree. This may be a good exercise for @prakhar2b , in preparation for fastText (like @tmylk says), but these utf8 changes obfuscate the code and are not the type of changes that the Phrases module needs.

Curious to see the benchmarks :)

piskvorky · 2017-06-23T15:46:00Z

gensim/models/phrases.py

            if sentence_no % progress_per == 0:
                logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
                            (sentence_no, total_words, len(vocab)))
-            sentence = [utils.any2utf8(w) for w in sentence]
+            if isinstance(sentence[0], bytes):


What happens if sentence is empty?

piskvorky · 2017-06-23T15:52:53Z

gensim/models/phrases.py

            if sentence_no % progress_per == 0:
                logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
                            (sentence_no, total_words, len(vocab)))
-            sentence = [utils.any2utf8(w) for w in sentence]
+            if isinstance(sentence[0], bytes):
+                sentence = [w for w in (utils.any2utf8(b';'.join(sentence)).split(b';'))]


What happens if the sentence tokens contain ;?

piskvorky · 2017-06-30T01:37:17Z

gensim/models/phrases.py

@@ -133,19 +133,32 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
        `delimiter` is the glue character used to join collocation tokens, and
        should be a byte string (e.g. b'_').

+        `recode_to_utf8` is an optional parameter (default True) for any2utf8 conversion of input sentences


How about

By default, the input sentences will be internally encoded to UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False to skip this recoding step in case you don't care about memory or if your sentences are already bytestrings. This will result in much faster training (~2x faster).

piskvorky · 2017-06-30T01:38:14Z

gensim/models/phrases.py

        """
        if min_count <= 0:
-            raise ValueError("min_count should be at least 1")
+            min_count = 1


Remove these two checks.

piskvorky · 2017-06-30T01:39:25Z

gensim/models/phrases.py

        self.min_count = min_count
        self.threshold = threshold
        self.max_vocab_size = max_vocab_size
        self.vocab = defaultdict(int)  # mapping between utf8 token => its count
        self.min_reduce = 1  # ignore any tokens with count smaller than this
        self.delimiter = delimiter
+        self.is_bytes = True  # for storing encoding type in vocab for supporting both unicode and bytestring input


Comment hard to understand, and I think it's because the logic is not clear. What is it that is bytes in is_bytes?

Isn't it better to simply create a flag of whether the input sequences are bytes or not?

Then the comment becomes # do the input sentences consist of bytestrings? which is clear.

Also, this check seems to belong to learn_vocab, not here.

EDIT - ok, I think self.is_input_bytes should be ok

Adding this comment (and I think it's rightly in init) -

self.is_bytes = True # do the input sentences consist of bytestrings? # With default (recode_to_utf8=True) we encode input sentences to utf8 bytestrings, but # with recode_to_utf8=False, we retain encoding, so need to store this encoding # information to later convert token inputs accordingly (in __getitem__ and export_phrases)

If it's rightly in init, then what happens if sentences=None and user is calling learn_vocab manually later?

@piskvorky updated the PR

piskvorky · 2017-06-30T09:29:08Z

gensim/models/phrases.py

@@ -133,19 +133,24 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
        `delimiter` is the glue character used to join collocation tokens, and
        should be a byte string (e.g. b'_').

+        `recode_to_utf8`- By default, the input sentences will be internally encoded to
+        UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False


Best to render code blocks as literal text (put recode_to_utf8=False in backticks).

piskvorky · 2017-06-30T09:29:24Z

gensim/models/phrases.py

+        `recode_to_utf8`- By default, the input sentences will be internally encoded to
+        UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False
+        to skip this recoding step in case you don't care about memory or if your sentences
+        are already bytestrings. This will result in much faster training (~2x faster)


Missing full stop at the end of sentence.

piskvorky · 2017-06-30T09:31:53Z

gensim/models/phrases.py

+            if self.recode_to_utf8:
+                s = [utils.any2utf8(w) for w in sentence]
+            else:
+                s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)


Looks like a bug -- recode is False but recoding happens.

@piskvorky is it true that only unicode string is accepted (and not bytestrings) as input token in getitem or export_phrases as mentioned here in phrases.py

@piskvorky this conversion was apparently meant to handle the mismatch between bytestrings training input and unicode token input . Is this an unprecedented use case, should we just raise a warning (like TypeError for differently encoded training and token input)

Sorry, I don't understand. If the user said "I don't want recoding", we shouldn't be recoding.

What is the motivation for this?

yes, we are not doing recoding while training Phrases.

For example, for recode_to_utf8=False and for unicode input sentences (for training), we have unicode words in vocab. Now if we provide bytestrings tokens in getitem or export_phrases, this mismatch will be a problem here in phrases.py. (or vice-versa)

For, no recoding at all, I think user will have to use same encoded corpus for both training, and phrases retrieval (getitem / export_phrases)

@jayantj @gojomo @menshikh-iv ref - above comment

cc @piskvorky

I think we shouldn't be recoding implicitly in __getitem__. The only gain is in the case when learn_vocab receives bytestrings, recode_to_utf8 is False and __getitem__ receives unicode. I don't think that justifies the dangerous subtle errors that it could cause.

Also, if we're going to be implicitly handling different input formats for learn_vocab vs __getitem__, we're also going to have to take care of the delimiter here

I think delimiter issue is sorted in learn_vocab here.

Just for information, what could be those dangerous errors ? The idea is simply to do the conversion for incoming token in __getitem__ to match the encoding in vocab.

As @piskvorky mentioned, suppose we had recode_to_utf8=False, and learn vocab was called with latin2 bytestrings, and latin2 bytestrings are sent to __getitem__. We'd silently recode the latin2 bytestrings to utf8 in __getitem__ and lookup would fail, even though it should succeed.

ok, yes I understand. I think fail explicitly should be the correct choice then (for mismatch in training input and infer input for recode_to_utf8=False). What kind of error should we raise then for this mismatch ?

piskvorky · 2017-06-30T09:32:06Z

gensim/models/phrases.py

+        if self.recode_to_utf8:
+            s = [utils.any2utf8(w) for w in sentence]
+        else:
+            s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)


Looks like a bug -- recode is False but recoding happens.

piskvorky · 2017-06-30T09:32:17Z

gensim/models/phrases.py

+        if self.recode_to_utf8:
+            s = [utils.any2utf8(w) for w in sentence]
+        else:
+            s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)


Looks like a bug -- recode is False but recoding happens.

jayantj · 2017-06-30T09:41:31Z

gensim/models/phrases.py

        """Collect unigram/bigram counts from the `sentences` iterable."""
+        if not self.recode_to_utf8 and sentences is not None:
+            sentence = list(next(iter(sentences)))


If I understand correctly, this will raise an exception if sentences is either an empty list or an empty generator.

In that case, ideally a simple test should be catching this too.

piskvorky · 2017-06-30T09:45:27Z

docs/notebooks/phrases-optimization-benchmark.ipynb

+   "source": [
+    "# currently on develop --- original code\n",
+    "from gensim.models import Phrases\n",
+    "bigram = Phrases(Text8Corpus(text8_file))\n",


Better convert the streamed iterable to an in-memory list (using list()), it's small enough. That way we don't have to iterate over the file from disk every time.

This will make the benchmark conclusions stronger (less noise and delays from other, unrelated parts of the code, IO overhead etc).

piskvorky · 2017-06-30T23:27:41Z

gensim/models/phrases.py

+            try:
+                sentence = list(next(iter(sentences)))
+            except:
+                raise ValueError("Input can not be empty list or generator.")


-1: why should empty input suddenly become a special case, raising an exception?

this concern was raised here in the review comment above, and also as discussed on gitter, sentences= [ ] is not None but it will throw an error with iter in above case

Yes, but that's a reason to fix the bug, not change the API :)

oh, I should have just logged a warning Empty sentences provided as input in except block and no need to raise the error.

Unless it's really a special case (and I don't think it is), there's no need to treat it in a special way.

No exception, no warning -- there's nothing special to an empty corpus with regard to Phrases, except your new check for its first element. It's not a special case.

piskvorky · 2017-07-06T11:29:07Z

@prakhar2b can you please fix the two issues I pointed out last week (list and no special case)?
Let's get this PR over with finally.

jayantj · 2017-07-06T12:37:08Z

gensim/models/phrases.py

+                    self.delimiter = utils.to_unicode(self.delimiter)
+                    self.is_input_bytes = False
+                sentences = it.chain([sentence], sentences)
+            except:


Using catch-all except blocks is generally a bad idea, since you could end up catching unexpected exceptions. So catch only the specific exception expected here (StopIteration?)

jayantj · 2017-07-06T12:39:36Z

gensim/models/phrases.py

+                    self.is_input_bytes = False
+                sentences = it.chain([sentence], sentences)
+            except:
+                #  No need to raise any exception or log any warning, as it's not a special case.


Debug message here would serve well too.

prakhar2b · 2017-07-17T08:12:24Z

I have a question regarding behaviour of phrases with bytestrings infer input.
Suppose we have -

b_sent = [b'survey', b'user', b'computer', b'system', b'response', b'time']
sent = [u'survey', u'user', u'computer', u'system', u'response', u'time']

If we do bigram[sent], we get expected ['survey', 'user', 'computer_system', 'response', 'time'].
However, bigram[b_sent] returns <gensim.interfaces.TransformedCorpus object at 0x7f73a7d466a0>.

I added this test (for bytestrings infer input) for recode_to_utf8=False but the behaviour is same for recode_to_utf8=True as well. I find it difficult to understand this behaviour. Therefore, before making any further changes, it would be better to know the POV of person who implemented it, if it is an intended behaviour or a bug ?

cc @gojomo @piskvorky @jayantj

piskvorky · 2017-07-17T10:34:57Z

Not sure how that comes about, but looks like a bug to me @prakhar2b (not intended behaviour).

menshikh-iv · 2017-08-03T10:11:15Z

What's status @prakhar2b?

piskvorky · 2017-08-08T18:22:09Z

@menshikh-iv should I reopen my #1454 , since this was closed?

menshikh-iv · 2017-08-09T06:03:39Z

@piskvorky I think no, now Filip works with #1446 as I understand

piskvorky · 2017-08-09T07:10:28Z

@menshikh-iv #1446 is completely orthogonal to this. You can convert to utf8 with/without a memory-bounded counter.

Prakhar Pratyush added 2 commits June 14, 2017 13:47

[TDD] test for Phrases ave load

b56b801

any2utf8 before save only

089bad3

prakhar2b changed the title ~~Phrases: convert tokens into utf8 (any2utf8) only before save~~ [WIP] Phrases: convert tokens into utf8 (any2utf8) only before save Jun 14, 2017

any2utf8 on entire sentence instead of each words separately

c0a1a79

prakhar2b changed the title ~~[WIP] Phrases: convert tokens into utf8 (any2utf8) only before save~~ [WIP] Phrases: apply any2utf8 on entire sentence instead of each word separately Jun 21, 2017

Prakhar Pratyush added 4 commits June 21, 2017 11:17

pep8 fixes

180d278

rsolved python3 byte error

2cbb840

resolving python3 error

f87a6a1

resolving python3 error

4445ac2

jayantj reviewed Jun 22, 2017

View reviewed changes

Prakhar Pratyush added 4 commits June 23, 2017 10:26

resolving python3 bytestring error

89fe6ad

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

82dbbf9

…into any2utf8

add assert for length after any2utf8

03acd0b

pep8 fixes

d8dc744

prakhar2b mentioned this pull request Jun 23, 2017

[WIP] Cythonizing phrases module #1385

Closed

piskvorky reviewed Jun 23, 2017

View reviewed changes

Prakhar Pratyush and others added 3 commits June 26, 2017 13:48

delimiter not punctuation now

ad69bd5

pep8 fixes

c7a885d

phrases optimization benchmark

34ded9b

prakhar2b changed the title ~~[WIP] Phrases: apply any2utf8 on entire sentence instead of each word separately~~ [MRG] Phrases: apply any2utf8 on entire sentence instead of each word separately Jun 27, 2017

prakhar2b changed the title ~~[MRG] Phrases: apply any2utf8 on entire sentence instead of each word separately~~ [WIP] Phrases: make any2utf8 tokenization optional Jun 27, 2017

Prakhar Pratyush added 2 commits June 28, 2017 10:35

[TDD] remove valueError test for min_count

aebe1c4

[TDD] test for recode_to_utf8 False

02aa404

check for empty sentences before checking for encoding

3bd4c03

piskvorky requested changes Jun 30, 2017

View reviewed changes

Prakhar Pratyush added 3 commits June 30, 2017 07:47

removed check and test for bad parameter

d1771df

docstring and comments modified

05a24d1

put is_nput_bytes and encoding check in learn_vocab instead of init

a28ef32

piskvorky requested changes Jun 30, 2017

View reviewed changes

jayantj reviewed Jun 30, 2017

View reviewed changes

piskvorky requested changes Jun 30, 2017

View reviewed changes

Prakhar Pratyush added 3 commits June 30, 2017 17:58

updated docstring for recode_to_utf8

86fde36

[TDD] failing test for empty list or generator as input

16c4696

raises valueError for empty list or generator as input

dfcde96

piskvorky requested changes Jun 30, 2017

View reviewed changes

piskvorky changed the title ~~[MRG] Phrases: make any2utf8 optional~~ [WIP] Phrases: make any2utf8 optional Jul 6, 2017

empty sentence not a special case, no exception or warning now

c0d17c4

jayantj reviewed Jul 6, 2017

View reviewed changes

Prakhar Pratyush added 4 commits July 6, 2017 18:56

specific exception and debug log added for empty list/generator input

3c2e1cd

converted the streamed iterable to an in-memory list for benchmark

7792e09

modified debug message for empty list input

0e4d862

no implicit conversion for infer input if recode_to_utf8=False

302f7f3

prakhar2b closed this Jul 29, 2017

prakhar2b reopened this Jul 29, 2017

menshikh-iv closed this Aug 8, 2017

[WIP] Phrases: make any2utf8 optional #1413

[WIP] Phrases: make any2utf8 optional #1413

Conversation

prakhar2b commented Jun 14, 2017 • edited Loading

piskvorky commented Jun 17, 2017

prakhar2b commented Jun 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jun 23, 2017 • edited Loading

tmylk commented Jun 23, 2017

jayantj commented Jun 23, 2017

piskvorky commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

prakhar2b Jul 1, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jul 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 1, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky commented Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b commented Jul 17, 2017

piskvorky commented Jul 17, 2017 • edited Loading

menshikh-iv commented Aug 3, 2017

piskvorky commented Aug 8, 2017 • edited Loading

menshikh-iv commented Aug 9, 2017

piskvorky commented Aug 9, 2017

prakhar2b commented Jun 14, 2017 •

edited

Loading

piskvorky commented Jun 23, 2017 •

edited

Loading

piskvorky Jun 30, 2017 •

edited

Loading

prakhar2b Jun 30, 2017 •

edited

Loading

prakhar2b Jun 30, 2017 •

edited

Loading

jayantj Jul 6, 2017 •

edited

Loading

piskvorky Jun 30, 2017 •

edited

Loading

piskvorky Jun 30, 2017 •

edited

Loading

prakhar2b Jul 1, 2017 •

edited

Loading

piskvorky Jul 1, 2017 •

edited

Loading

piskvorky Jul 1, 2017 •

edited

Loading

piskvorky commented Jul 6, 2017 •

edited

Loading

piskvorky commented Jul 17, 2017 •

edited

Loading

piskvorky commented Aug 8, 2017 •

edited

Loading